Dynamic ProgrammingΒΆ

By Rohit PardasaniΒΆ

We often talk about two different tasks

Policy Evaluation and ControlΒΆ

Policy Evaluation : The task of computing the state-value function $v_\pi$ for an arbitrary policy $\pi$. We also refer to it as the prediction problem.
Control : It is the task of finding a policy to obtain as much reward as possible. Basically, finding a policy that maximizes the value function. Control is the ultimate goal of reinforcement learning.

ΒΆ

$\pi_2$ is strictly better than $\pi_1$ iff $\pi_2$ is atleast as good as or better than $\pi_1$ and there is at least one state where $v_{\pi_2}(s)$>$v_{\pi_1}(s)$

In control, we repeatedly improve the policy to obtain a sequence of better and better policy, until it is no longer possible to obtain a better policy. Once it is not possible to obtain a better policy, we say that we have obtained optimal policy.

Dynamic Programming using Bellman Equations, Bellman Optimatility Equations and knowledge of $p$ for policy evaluation and control

Iterative Policy EvaluationΒΆ



Policy ImprovementΒΆ

Policy IterationΒΆ

Flexibility of Policy Iteration FrameworkΒΆ

We may not run policy evaluation to completion and we may not run policy iteration to complete greedification.
In value iteration we don't run policy evaluation to completion.
We just run a single sweep of policy evaluation through all states.
This is followed by greedification with respect to state values.
Even in value iteration just like policy iteration sweep is performed to update all states.
Synchronous DP:In synchronous dynamic programming all states are updated in a certain order. Basically, a systematic sweep through all states is performed.
Asynchronous DP:In asynchronous dynamic programming all states are not updated in a certain order. Basically, systematic through all states is not performed. Also some states may be updated more often than other states. This is helpful when state space is large. But it should be ensured that all states must be updated at some point.

Efficiency of Dynamic ProgrammingΒΆ

We will compare dynamic programming with two alternative approaches:
1. Monte Carlo Method
2. Brute Force Search